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1 Preliminaries 

Matching is important class of combinatorial optimization problems with many real-life appli- 
cations. Matching problems involve choosing a subset of edges of a graph subject to degree 
constraints on the vertices. Many alignment problems arising in computational biology are special 
cases of matching in bipartite graphs. In these problems the vertices of the graph can be nu- 
cleotides of a DNA sequence, aminoacids of a protein sequence or secondary structure elements of 
a protein structure. Unlike classical matching problems, alignment problems have intrinsic order 
on the graph vertices and this implies extra constraints on the edges. As an example, Fig. shows 
an alignment of two sequences as a matching in bipartite graph. We can see that the feasible 
alignments are 1-matchings without crossing edges. 

In this paper we deal with the problem of aligning a protein structure template to a query 
protein sequence of length N, known as protein threading problem. A template is an ordered 
set of m secondary structure elements (or blocks) of lengths i = 1, . . . ,m. An alignment (or 
threading) is covering of contiguous sequence areas by the blocks. A threading is called feasible if 
the blocks preserve their order and do not overlap. A threading is completely determined by the 
starting positions of all blocks. For the sake of simplicity we will use relative positions. If block i 
starts at the jth query character, its relative position is n — j — Y?k=i h- In this way the possible 
(relative) positions of each segment are between 1 and n = N + 1 — Y^iLi h ( see Fig. Gib)). The 
set of feasible threadings is 



Protein threading problem is a matching problem in a bipartite graph (U LiV,U xV), where 
U = {m, . . • , u m } is the ordered set of blocks and V = {v\, . . . , v n } is the ordered set of relative 
positions. The threading feasibility conditions can be restated in terms of matching in the following 
way. A matching M C U x V is feasible if: 

i d{u) = 1, u € U (where d(x) is the degree of x). This means that each block is assigned to 
exactly one position) . By the way this implies that the cardinality of each feasible matching 
is m. 

ii There are no crossing edges, or more precisely, if (ui,Vj) £ M, (uk,vi) £ M and i < k, then 
j < I. This means that the blocks preserve their order and do not overlap. The last inequality 
is not strict because of using relative positions. 



T = {(n, . . . ,r m ) | 1 < ri < ■ ■ ■ < r m < n}. 
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Figure 1: Matching interpretation of sequence alignment problem 
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Figure 2: (a) Example of alignment of query sequence of length 20 and template containing 3 
segments of lengths 3, 5 and 4. (b) Correspondence between absolute and relative block positions, 
(c) A matching corresponding to the alignment of (a) . 



2 



position 




i=l i = 2 i = 3 i = 4 i = 5 i = 6 block 



Figure 3: Example of alignment graph. The path in thick lines corresponds to the threading in 
which the positions of the blocks are 1,2,2,3,4,4. 



Note that while (i) is a classical matching constraint, (ii) is specific for the alignment problems 
and makes them more difficult. Fig. [2c) shows a matching corresponding to a feasible threading. 

An alternative approach is to define the relative positions as r, = j — YX=i h + i — 1- In this 
case the relative positions of the feasible threadings are related by 

1 < n < ■ • • < r m < m + n — 1 

and the feasible threadings are 1-matchings without crossing edges. A threading is determined by 
choosing m out of m + n — 1 positions and the number of feasible threadings is |T| = ( m4 ^ -1 )- 

One of the possible ways to deal with alignment problems is to try to adapt the existing 
matching techniques to the new edge constraints of type (ii) . Instead of doing this we propose a 
new graph model with some nice properties allowing to develop efficient algorithms on the top of 
it. 

We introduce an alignment graph G = (U xV,E). Each vertex of this graph corresponds to 
an edge of the matching graph. For simplicity we will denote the vertices by Uy , i = 1, . . . , m, 
j = 1, . . . , n and draw them as an n x m grid (see Fig. EJ) - One can connect by edges the pairs of 
vertices of G which correspond to pairs of noncrossing edges in the matching graph. In this case 
a feasible threading is an m-clique in G. A similar approach is used in [?]. We introduce only a 
subset of the above edges, namely the ones that connect vertices from adjacent columns and have 
the following regular pattern: E = {(Vij, | i = 1, • ■ • , m — 1, 1 < j < I < n). We add two 

more vertices S and T and edges connecting S to all vertices from the first column and T to all 
vertices from the last column. Now it is easy to see the one-to-one correspondence between the 
set of feasible threadings (or matchings) and the set of S-T paths in G. Fig. illustrates this 
correspondence. 

Till now we gave several alternative ways to describe the feasible alignments. Alignment 
problems in computational biology involve choosing the best of them based on some score function. 
The simplest score functions associate weights to the edges of the matching graph. For example, 
this is the case of sequence alignment problems. By introducing alignment graphs similar to the 
above, classical sequence alignment algorithms, such as Smith- Waterman or Needleman-Wunch, 
can be viewed as finding shortest S-T paths. When the score functions use structural information, 
the problems are more difficult and the shortest path model cannot incorporate this information. 

The score functions in PTP evaluate the degree of compatibility between the sequence amino 
acids and their positions in the template blocks. The interactions (or links) between the template 
blocks are described by the so-called generalized contact map graph, whose vertices are the blocks 
and whose edges connect pairs of interacting blocks. Let L be the set of these edges: 

L = {(i, k) | i < k and blocks i and k interact} 

Sometimes we need to distinguish the links between adjacent blocks and the other links. Let 
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i = l i = 2 i = 3 i = 4 i = 5 i = 6 block 




Figure 4: Example of augmented path. The generalized contact map graph is given in the bottom. 
The x arcs of the S-T path are in solid lines. The activated z-arcs are in dashed lines. The length 
of the augmented path is equal to the score of the threading (1, 2, 2, 3, 4, 4). 

R = {(i, k) | (i, k) G L, k — i > 1} be the set of remote (or non-local) links. The links from L\R 
are called local links. Without loss of generality we can suppose that all pairs of adjacent blocks 
interact. 

The links between the blocks generate scores which depend on the block positions. In this way 
a score function of PTP can be presented by the following sets of coefficients 

• Cij, i = 1, . . . , m, j = 1, . . . , n, the score of putting block i on position j 

• dijki, (i, k) e L, 1 < j < I < n, the score generated by the interaction between blocks i and 
k when block i is on position j and block k is on position I. 

The coefficients cy are some function (usually sum) of the preferences of each query amino acid 
placed in block i for occupying its assigned position, as well as the scores of pairwise interactions 
between amino acids belonging to block i. The coefficients dijki include the scores of interactions 
between pairs of amino acids belonging to blocks i and j. Loops may also have sequence specific 
scores, included in the coefficients dij^+ij. 

The score of a threading is the sum of the corresponding score coefficients and PTP is the 
optimization problem of finding the threading of minimum score. If there are no remote links (if 
R = 0) we can put the score coefficients on the vertices and the edges of the alignment graph and 
PTP is equivalent to the problem of finding the shortest S-T path. In order to take the remote 
links into account, we introduce the edges 

{(vij,v kl ) \(i,k) eR, l<j<l<n} 

which we will refer as z-edges. 

An S-T path is said to activate the z-edges that have both ends on this path. Each S-T path 
activates exactly \R\ z-edges, one for each link in R. The subgraph induced by the edges of an 
S-T path and the activated z-edges is called augmented path. Thus PTP is equivalent to finding 
the shortest augmented path in the alignment graph. See Fig. 

As we will see later, the main advantage of this graph is that some simple alignment problems 
reduce to finding the shortest S-T path in it with some prices associated to the edges and/or 
vertices. The last problem can be easily solved by a trivial dynamic programming algorithm of 
complexity 0(mn 2 ). In order to address the general case we need to represent it from graph- 
theoretical language to an integer programming problem. 
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2 Integer programming formulation 



In the PTP network formulation the vertices (i,j),j = l,...,n will be referred to as ith layer. Each 
layer corresponds to a structural element, and each vertex in a layer corresponds to a positioning 
of this element on a query protein. Let L Q {(i,k) \ 1 < i < k < m} be the set of inter-layers 
interactions. The graph with vertices {1, . . . , to} and edges L is called generalized contact map 
graph: a link between layers i and k means that the corresponding structural elements are in 
contact in the 3D structure. 

Let yij be binary variables associated to the vertices of G. y^ is one if block i is on position j 
and zero otherwise. Let A ik be the 2nx node-arc incidence matrix for the subgraph spanned 

by the layers i and k, (i, k) s L. The submatrix A\, the first n rows of A^, (resp. A\, the last 
n rows) corresponds to the layer i (resp. k). To avoid added notation we will use vector notation 
for the variables yi = {yn, ...yi n ) G B n where B n is the set of n-dimensional binary vectors, with 
assigned costs a = {cn,...c m ) £ R n and z ik = (z»ifci, • . . , znkn, z&ki, ■ ■ ■ , Zinkn) G B ^~ for 
(i, k) G L with assigned costs d ik = (diifci, ■ ■ ■ , d akn , d i2 ki, ■ ■ ■ , d inkn ) G R a . In the sections 
below the vector di k will be considered as a n x n upper triangular matrix, having arbitrarily large 
coefficients below the diagonal. This slight deviation from the standard definition of an upper 
triangular matrix is used only for formal definition of some matrix operations. 

Now the protein threading problem PTP(L) is defined as: 

m 

zf P = v(PTP(L)) = mm{^2c t y t + ^ d ^ lk ] (1) 

i=l (i,k)£L 

subject to: y = (y%, . . . , y m ) G Y, (2) 

y t =A\z lk (i,k)eL (3) 

yk = A\z lk (i,k)eL (4) 

z ik e B 11 ^ 11 (i,k)eL (5) 

where Y is defined by (SJ-©- The shortcut notation v(.) will be used for the optimal objective 
function value of a subproblem obtained from PTP(L) with some z variables fixed. 



3 Complexity results 

In this section we study the structure of the polytope defined by I©-© and zi k G R + 2 , as 
well the impact of the set L on the complexity of the algorithms for solving the PTP problem. 
Throughout this section, vertex costs c, are assumed to be zero. An edge in the contact graph 
will be called local if it is generated by a neighboring inter-layer interaction (i.e. a link (i, i + 
is presented in the set L). Here below we present four sorts of contact graph that make PTP 
polynomially solvable. 



3.1 Contact graph contains only local edges 

An important characteristic of the alignment graph in this case is that it has a tight LP description. 
Let Y be the set defined by the constraints 

n 

^2/y = l i = l,...,m (6) 
j=i 

i j 

~ ^Vi+t.i > i = l,...,m-l, j = l, ...,n-l (7) 
i=i i=i 

Vij>0 i = l,...,m,j = l,...,n (8) 
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Constraints El ensure the feasibility condition (i) and are responsible for (ii) . That is why 
Y (~l B mn is exactly the set of feasible threadings. The following proposition shows an important 
and desirable property of the polytope Y . 

Proposition 1 The polytope Y is integral, i.e. it has only integer-valued vertices. 

Proof: Let A be the matrix of the coefficients in Q, © with the columns numbered by 
the indices of the variables. One can prove that A is totaly unimodular (TU) by performing the 
following sequence of TU preserving transformations. 

for j = 1, m 

delete column (j, n) ( these are unit columns ) 

for k = 1, to 
for j = n — 1 , 1 

pivot on akj (A is TU iff the matrix obtained by a pivot operation on A is TU) 

delete column (k,j) ( this is now unit column ) 

The final matrix is an unit column that is TU. The assertion follows from A is TU. One could prove 
the same assertion by showing that an arbitrary feasible solution to ©-(0) is a convex combination 
of some integer-valued vertices of Y. The best such vertex (in the sense of an objective function) 
might be a good approximate solution to a problem whose feasible set is an intersection of Y with 
additional constraints. 

Let y is an arbitrary non -integer solution to Because of ©, © an unit flow 1 / = 

(fsj, f(i,k){i+i,j)) i = 1> m - 1 j = l,n in G exist s.t. 

^2 f(i,k)(i+i,j) =Vij £ = 1, m - 1 f sj = yij j = 1, n 

k<j 

By the well known properties of the network flow polytope, the flow / can be expressed as a 
convex combination of integer- valued unit flows (paths in G). But each such flow corresponds 
to an integer-valued y, i.e. yij = f(i-i,k)(ij) — 1- Thus, the convex combination of the paths that 
gives / is equivalent to a convex combination of the respective vertices of Y that gives y. 

The details for efficiently finding of the set of the vertices participating in the convex combi- 
nation could be easily stressed by this sketch of the prove. ■ 

3.2 Contact graph contains no crossing edges 

Two links (ii,k%) and (£2,^2) such that i\ < £2 are said to be crossing when k\ is in the open 
interval (£2, fe)- The case when the contact graph L contains no crossing edges has been mentioned 
to be polynomially solvable for the first time in [lj. Here we present a different sketch for 0(n 3 ) 
complexity of PTP in this case. 

If L contains no crossing edges, then PTP(L) can be recursively divided into independent 
subproblems. Each of them consists in computing all shortest paths between the vertices of two 
layers i and k, discarding links that are not included in (£, k). Thus the result of this computation 
is a distance matrix such that Dik(j,l) is the optimal length between vertices and (k,l). 
Note that for j > I, as there is no path in the graph, Dik(j,l) is an arbitrarily large coefficient. 
Finally, the solution of PTP(L) is the smallest entry of D\ m . 

We say that an edge (£,fc),£ < k is included in the interval [a, b] when [i,k] C [a, b}. Let us 
denote by L/^ the set of edges of L included in [£, k]. Then, an algorithm to compute can be 
sketched as follows: 

lr The 4 indeces i,k,p,j used for arcs labeling follows the convention: tail at vertex (i, k) head at vertex (p,j). 
Sometimes the brackets will be dropped. 
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1. if L(jfe) = {(i, k)} then the distance matrix is given by 

_ ( d ik if(*,&) G I , Q * 
Jfc ~ \ otherwise W 

where is an upper triangular matrix in the previously defined sense (arbitrary large coef- 
ficients below the main diagonal) and having only zeros in its upper part. 

2. otherwise as Luk) h as no crossing edges, there exists some s G [i, k] such that any edge of 
L(jfe) but (i,k) is included either in [i,s] or in [s,k]. Then 

_ j D is .D s k + d ik if{i,k)eL 
lk ~ \ D ls .D sk otherwise 1 > 

where the matrix multiplication is computed by replacing (+, x) operations on reals by 
(min, +). 

Remark 1 If the contact graph has m vertices, and contains no crossing edges, then the problem 
is decomposed into 0{m) subproblems. For each of them, the computation of the corresponding 
distance matrix is a 0(n 3 ) procedure (matrix multiplication with (min, +) operations). Overall 
complexity is thus 0(mn 3 ). Typically, n is one or two orders of magnitude greater than m, and 
in practice, this special case is already expensive to solve. 



3.3 Contact graph is a single star 

A set of edges L = k\), . . . , (i r , k r )}, k\ < fc 2 < . . . k r is called a star 2 if it has at least two 
elements and i t = i\, t < r. The arc costs corresponding to the link (i, k s ) are given by the upper 
triangular matrix di ka ■ 

The following algebra is used to prove the 0{n 2 ) complexity of the corresponding PTP. 

Def. 1 Let A, B be two matrices of size n xn. M — A® B is defined by M(i, j) = min A(i, r) + 

i<r<j 

B(i,j) 

In order to compute A eg) B, we use the following recursion: let M' be the matrix defined by 
M'(i,j) = min A(i,r), then 

i<r<j 

M'(i,j) = mm{M'(i,j - 1), A(i,j)}, forallj > i 

Finally A (gi B — M' + B. From this it is clear that ® multiplication for n x n matrices is of 
complexity 0(n 2 ). 

Theorem 1 Let L = {(i, fci), ...,(«, k r )} be a star. Then Dik r = (. . . {dik 1 ® dik 2 ) <£> . . . ) dik r 

Proof: The proof follows the basic dynamic programming recursion for this particular case: for 
the star L = {(i, ki), k r )} = L' \J{{i, k r )}, we have v(L : z ijkr i = 1) = d ljkr i + min v(L' : 

j<s<l 

Zijk r _is = 1) B 



3.4 Contact graph is a sequence of independent subproblems 

Given a contact graph L = k\), . . . , (i r , k r )}, PTP(L) can be decomposed into two indepen- 
dent subproblems when there exists an integer e G (1, m) such that any edge of L is included either 
in [1, e], either in [e, m]. Let / = {ii, . . . , i s } be an ordered set of indices, such that any element of 
/ allows for a decomposition of PTP(L) into two independent subproblems. Suppose additionally 
that for all t < s — 1, one is able to compute D itit+1 . Then we have the following theorem: 

2 This definition corresponds to the case when all edges have their left end tied to a common vertex. Star can 
be symmetrically defined: i.e. all edges have their right end tied to a common vertex. All proofs require minor 
modification to fit this case. 
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Theorem 2 Let p = (pi,P2, ■ ■ ■ ,Pn) be obtained by the following matrix-vector multiplication 
P = Di t i 2 Di 2 i 3 . . . Di s _ 1 i 3 p, where p = (0,0,..., 0) and the scalar product in the matrix-vector 
multiplication is defined by changing "+" with "min" and "." with "+". Then for all i, pi = 
v{PTP(L : y u = I)), and v{PTP(L)) = minfe}. 

Proof: Each multiplication by D ikik+1 in the definition of p is an algebraic restatement of the 
main step of the algorithm for solving the shortest path problem in a graph without circuits. ■ 

Remark 2 With the notations introduced above, the complexity of PTP(L) for a sequence of such 
subproblems is 0(sn 2 ) plus the cost of computing matrices Dj t ,- t+1 , 

From the last two special cases, it can be seen that if the contact graph can be decomposed 
into independent subsets, and if these subsets are single edges or stars, then there is a 0(srn 2 ) 
algorithm, where s is the cardinality of the decomposition, and r the maximal cardinality of each 
subset, that solves the corresponding PTP. 

Remark 3 As a corollary from proposition^] we can easily derive than when L is cross free and 
does not contain stars, the polytope defined by and Zik 6 R + 2 is integer-valued. 

3.5 About the threading polytope 

All of the preceeding polynomiality results were derived without any refering to the LP relaxation 
of CO)-©. The reason is that even for a rather simple version of the graph L the polytope P yz 
defined by 0-© with © changed to : 0< < 1 is non-integral.To prove this, let's call the 
convex hull of the feasible points of IS)-© .denoted as P yz a threading polytope. We have already 
seen (indirectly) that if L contains local links only then the threading polytope is P yz . Recall 
the one-to-one correspondence between the threadings, defined as points in Y and the paths in 
graph G.When L = {(i,i + = l,m— 1} (this could be always achieved by adding zero costs 
local z arcs) then (2J-© are linear description of a unit flow in G that is an integral polytope. 
Unfortunately, this happened to be a nesessary condition also. 

Theorem 3 Let n > 3 and L contains all local links. Then P yz — P yz if and only if R = 
Proof: (=>■) 

W.l.o.g. we can take R = (1,3) and m = 3, n = 3. Then the point A = (yu = j/12 = j/21 = 
2/22 = 0.5,j/32 = 0.75, 2/33 = 0.25, z 112 \ = z 2 i 3 2 = 24222 = ^1232 = 0.5,z 223 2 = Z2233 = 21132 = 
21133 = 0.25, ) € P yz and the only eligible (whose convex hull could possibly contains A) integer- 
valued vertices of P yz are B = (y u = y 2 i = 2/32 = 21132 = 1) and C = {y 12 = 2/22 = 2/32 = 21232 = 
1) but A is not in the segment [B,C]. The generalization of this proof for arbitrary to, n > 3 
and R is almost straighforward. 

(<=) follows directly from^] ■ 
This is a kind of negative result seting a limit for relaying on LP to ??? 



4 Lagrangian relaxation 

Consider an integer program 

Zip = min{ca; : x € S}, whereS 1 = {x G Z" : Ax < b} (11) 

Relaxation and duality are the two main ways of determining Zip and upper bounds for zjp. 
The linear programming relaxation is obtained by changing the constraint x £ Z™ in the def- 
inition of S by x > 0. The Lagrangian relaxation is very convenient for problems where the 
constraints can be partitioned into a set of 'simple' ones and a set of 'complicated' ones. Let us 
assume for example that the complicated constraints are given by A 1 x < b 1 , where A 1 is m x n 
matrix, while the nice constraints are given by A 2 x < b 2 . Then for any A € i?™ the problem 
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zlrM = min-fca; + Affo 1 — A 1 x)\, where Q = {x e Z? : ^4 2 £ < 6 2 } is the Lagrangian relaxation 

x£Q T ' 

of dJ, i.e. zlr(X) < zip for each A > 0. The best bound can be obtained by solving the 
Lagrangian dual zld = maxzLR(X). It is well known that relations Zip > zld > zlp hold. 

A>0 

An even better relaxation, called cost- splitting, can be obtained by applying Lagrangian duality 
to the reformulation of ifTDl given by 

Zip = min cx 1 (12) 



subject to: A 1 x 1 < b 1 , A 2 x 2 leqb 2 , (13) 
x 1 - x 2 = (14) 

x 1 eZl, x 2 e Z%, (15) 

Taking x 1 — x 2 = as the complicated constraint, we obtain the Lagrangian dual of l(T2l) - l(To]) 

Zip = max{min car + mine 2 a: 2 } (16) 



subject to: A 1 x 1 < ft 1 , A 2 x 2 < b 2 , (17) 
x 1 6 Zl, x 2 e Zl, (18) 

where zt = c 2 , c 1 = c — u. 

In both relaxations in order to find zld or z cs d one has to look for the maximum of a con- 
cave piece-wise linear function. This appeals for using the s.c. sub-gradient optimization tech- 
nique. For the function Zlr{X), the vector s* = b 1 — A 1 x t , where x l is an optimal solution to 

min{cx + \ l (b x — ^l 1 ^)}, is a subgradient at A*. The analog of steepest ascent method for maxi- 

Q 

mizing a function is following subgradient algorithm: 

• (Initialization): Choose a starting point A , 6o and p. Set t — and find a subgradient s*. 

• While s* 7^ and t < iter_ max_ number do { A t+1 = A* + 6(5*; t <— t + 1; find s*} 

This program stops either when s* = 0, (in which case A* is an optimal solution) or after a fixed 
number of iterations. Our scheme for selecting {O*} is 0t = QoP* where < p < 1. 



5 Lagrangian relaxation 

Relying on complexity results from section UH we show now how to apply Lagrangian relaxation- 
takin as the complicating constraints Eqs. ©-0- The constraint yt — Afzik (res yu — A l k Zik) 
insures that y-variables and z- variables select the same position for block i (resp. block k), and 
that is what makes PTP a hard problem. For each link in the contact graph, we decide to relax 
one of the two constraints. This means that for any link (i, k), the position of exactly one of the 
blocks i and k, as specified by z- variables, can be chosen freely. More formally let {L a ,Lp} be a 
partition of the contact graph. The relaxation of PTP is obtained by replacing Eqs. ©-0 with: 

y i =A\z ih (i,k)eL a (19) 

yk=A%Zi k (i,k)eL p (20) 

and changing the objective function to: 

Zlr(X) = min < ^2 c iVi + ^2 dikZik + ^ X lk {y k - A\z ik ) + ^ Xik(Vi ~ A\z lk ) > 

V,Z [i=l (i.fc)ei {i,k)eL a (i,k)eL J 



9 



By rearranging the terms in Q3 



zlr{X) = min < ^ c i (A)y i + d ik (\)z ik 

= 1 (i,k)eL 



with: 



(i,k)€L a (k,i)eLf) 

d ik (X) = d ik - X ^ A \~ E XlkA k ( 22 ) 

(23) 

To make the relaxation tighter, we include monotonicity constraints on z-variables. Denoting 
u = (l,...,n), 

uAfzjfc < tiA^i (i, k) e L (24) 

ui* 1 . i z 1 t 1 <«4 , fe2 z 1 t ! (i,fci),(i,fe 2 ) € L Q , kl< k2 (25) 

wA^z ilfc < uA^Zizk (h,k),{i 2 ,k) € Lp, kl< k2 (26) 



5.1 Solving the relaxation 

Let y £ Y be an optimal path in the alignment graph. Supposing g/y = 1 (block i is at position 
j) we can derive all Zi kl , ■ ■ ■ , Zik r for (i, fci), . . . , («, fc r ) G L Q , by solving: 

9?j = l min ; c^;, + • • • + dy fcr ; r 

This can be computed by a basic dynamic programming routine. More precisely the vector gf 
is obtained by performing the operation (. . . (di kl <8> di k2 ) <8> • • • ) <8> difc r described in paragraph ESI 
Symmetrically, z kli , . . . , z kri for (ki,i), . . . , (fc r ,i) e La, solving: 

gfj = min d kllllj ^ V d kr i rlj 

Then, the optimal relaxed solution can be found by solving the recurrence: 
f(i+\)l = min{/ j; , + c (i+1)i + gf l+1)i + 9 {i+1) i 



6 Cost splitting 

In order to apply the results from the previous section, we need to find a suitable partition of L into 
L 1 (J L 2 ... 1J L* where each L s induces an easy solvable PTP(L S ), and to use the s.c. cost-splitting 
variant of the Lagrangian duality. Now we can restate (QJ- © equivalently as: 



E(E c ^ s + E ( 27 ) 

s=l i=l (i,k)£L s 



(28) 
(29) 
(30) 

(31) 



subject to: y i = y^ 


s = 


2,t 






y s = (yi,..y s m )&Y, 


s = 


1,.. 


.,t 




y- = AiZ ikl y s k = A k z ik 


s = 


1,.. 


.,t 


(i,k) e L s 


„(n+l) 

Zi k e B * 


s = 


1,.. 


.,t 


(i, fc) e L s 
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Figure 5: Cost-Splitting Lagrangi 



Plot of time in seconds with CS-LR algo- 
rithm on the x-axis and the LP algorithm 
from j2] on the y-axis. Both algorithms com- 
pute approximated solutions for 962 thread- 
ing instances associated to the template 
1ASYA0 from the FROST database. The 
linear curve in the plot is the line y = x. 
What is observed is a significant perfor- 
mance gap between the algorithms. For ex- 
ample in a point {x,y) = (0.5,3) CS-LR is 
10 2 ' 5 times faster than LP relaxation. 



Relaxation versus LP Relaxation 



Taking l|28jl as the complicating constraints, we obtain the Lagrangian dual of PTP(L): 

t m t 
v csd = maxmin^(^c l s (A)y l s + ^ d lk z ik ) = max ^ vf° (A) (32) 

V s=l i=l (i,k)eL° s=l 

subject to (f29"| . il3f))l and J3TJ). 

The Lagrangian multipliers X s are associated with the equations i!28|) and c}(X) = c} + J2 s=2 A s , 
cf (A) = cf — X s , s — 2, . . . , t. The coefficients cf are arbitrary (but fixed) decomposition (cost-split) 
of the coefficients Ci, i.e. given by cf = p s Ci with J^Ps = 1- From the Lagrangian duality theory 
follows vi p < v cs d < Vi P . This means that for each PTP instance s.t. vi p = Vi P holds v cs d — Vi P . By 
applying the subgradient optimization technique (QSj) m order to obtain v cs d, one need to solve t 
problems v\ v (A) (see the definition of vf p ) for each A generated during the subgradient iterations. 
As usual, the most time consuming step is PTP(L S ) solving, but we have demonstrated its 0(n 2 ) 
complexity in the case when L s is a union of independent sheaves and single links. More details 
concerning the actual implementation are given in Appendix ??. 



7 Experimental results 

The numerical results presented in this section were obtained on an Intel (R) Xeon(TM) CPU 
2.4 GHz, 2 GB RAM, RedHat 9 Linux. The behavior of the algorithm was tested by computing 
the same distributions as given in 3 (for the purpose of comparison) , plus few extra-large instances 
based on real- life data generated by FROST (Fold Recognition Oriented Search Tool) software 0. 
The MIP models were solved using CPLEX 7.1 solver 0. 

In our first computational experiment we focus on computing score distributions phase and we 
study the quality of the approximated solutions given by three PTP algorithms. Five distributions 
are associated to any 3D template in the FROST database. They are computed by threading the 
template with sets of non related protein sequences having length respectively equal to: -30%, 
-15%, 0%, +15%, +30% of the template length. Any of these sets contains approximately 200 
sequences. 

Hence, computing a score distribution in the FROST database requires solving approximately 
1000 sequence-to-template alignments. Only two values will be finally used: these are the score 
values obtained at the 1st and at the 3rd quartiles of the distribution (denoted respectively by g25 
and (775). FROST uses the following scheme: the raw score (RS) (i.e. the score obtained when a 
given query is aligned with the template) is normalized according to the formula NS = SE|~f • 
Only the value NS (called normalized score) is used to evaluate the relevance of the computed 
raw score to the considered distribution. 
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Figure 6: Plot of time in seconds with CS-LR (Cost-Splitting Lagrangian Relaxation) algorithm 
on the x-axis versus SB-LR (Stefan Balev's Lagrangian Relaxation) algorithm on the y-axis 
concerning score distributions of two templates. Both the x-axis and y-axis are in logarithmic 
scales. The linear curve in the plot is the line y — x. Left: The template 1ASYA (the one 
referenced in [Hj) has been threaded with 962 sequences. Right: 1ALO_0 is one of the templates 
yielding the biggest problem instances when aligned with the 704 sequences associated to it in the 
database. We observe that although CS-LR is often faster than SB-LR, in general the performance 
of both algorithms is very close. 

We conducted the following experiment. For the purpose of this section we chose a set of 12 
non-trivial templates. 60 distributions are associated to them. We first computed these distribu- 
tions using an exact algorithm for solving the underlying PTP problem. The same distributions 
have been afterwords computed using the approximated solutions obtained by any of the three 
algorithms here considered. By approximated solution we mean respectively the following: i) for 
a MIP model this is the solution given by the LP relaxation; ii) for SB-LR (Stefan Balev's La- 
grangian Relaxation) algorithm this is the solution obtained for 500 iterations (the upper bound 
used in p]). Any exit with less than 500 iterations is a sign that the exact value has been found; 
iii) for the Cost-Splitting Lagrangian Relaxation algorithm (CS-LR) this is the solution obtained 
either for 300 iterations or when the relative error between upper and lower bound is less than 
0.001. 

We use the MYZ integer programming model introduced in j2] . It has been proved faster than 
the MIP model used in the package RAPTOR which was well ranked among all non-meta 
servers in CAFASP3 (Third Critical Assessment of Fully Automated Structure Prediction) and in 
CASP6 (Sixth Critical Assessment of Structure Prediction). Because of time limit we present here 
the results from 10 distributions only 3 . Concerning the 1st quartile the relative error between the 
exact and approximated solution is 3 x 10~ 3 in two cases over all 2000 instances and less than 
10 -6 for all other cases. Concerning the 3rd quartile, the relative error is 10 -3 in two cases and 
less than 10 -6 for all other cases. 

All 12125 alignments for the set of 60 templates have been computed by the other two algo- 
rithms. Concerning the 1st quartile, the exact and approximated solution are equal for all cases 
for both (SB-LR and CS-LR) algorithms. Concerning the 3rd quartile and in case of SB-LR al- 
gorithm the exact solution equals the approximated one in all but two cases in which the relative 
error is respectively 10 -3 and 10 -5 . In the same quartile and in case of CS-LR algorithm the 
exact solution equals the approximated one in 12119 instances and the relative error is 7 x 10~ 4 
in only 6 cases. 

Obviously, this loss of precision (due to computing the distribution by not always taking the 
optimal solution) is negligible and does not degrade the quality of the prediction. We therefore 
conclude that the approximated solutions given by any of above mentioned algorithm can be 
successfully used in the score distributions phase. 

Our second numerical experiment concerns running time comparisons for computing approxi- 
mated solutions by LP, SB-LR and CS-LR algorithms. The obtained results are summarized on 
figures El and d Figure clearly shows that CS-LR algorithm significantly outperforms the LP 
relaxation. Figures H3 and illustrate that CS-LR is mostly faster than SB-LR algorithm. Time 

3 More data will be solved and provided for the final version. 
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CS-LR/ S6-LR, running time to compute distributions. 




CPU time in loglO(seconds) tor the CS-LR algorithm 



Plot of time in seconds with CS-LR algo- 
rithm on the x-axis and the SB-LR algo- 
rithm on the y-axis. Each point corresponds 
to the total time needed to compute one dis- 
tribution determined by approximately 200 
alignments of the same size. 61 distributions 
have been computed which needed solving 
totally 12125 alignments. Both the ir-axis 
and y-axis are in logarithmic scales. The 
linear curve in the plot is the line y = x. 
CS-LR is consistently faster than SB-LR al- 
gorithm. 



Figure 7: CS-LR versus SB-LR : recapitulation plot concerning 12125 alignments, 
sensitivity with respect to the size of the problem is given in Fig. 

8 Conclusion 

The results in this paper confirm once more, that integer programming approach is well suited 
to solve protein threading problem. Here, we proposed a cost-splitting approach, and derived a 
new Lagrangian dual formulation for this problem. This approach compares favorably with the 
Lagrangian relaxation proposed in 0. It allows to solve huge instances 4 , with solution space of 
size up to O(10 77 ), within a few minutes. 

The results lead us to think that even better performance could be obtained by relaxing 
additional constraints, relying on the quality of LP bounds. In this manner, the relaxed problem 
will be easier to solve. This is the subject of our current work. 
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